1 Overview

These are toplots and searchable, sortable datatables to accompany the main text. They are divided into unigram, bigram, and trigram sections. The unigrams section considers the occurrences of segments by syllabic position. The bigram and trigram sections consider (not necessarily sequential) co-occurrences of two or three segments, respectively.

For some n-grams, data are plotted as histograms or heatmaps. These are hidden by default, but can be revealed (and hidden again) by clicking the appropriate button.

In all tables, the data can be copied to the clipboard or saved as a CSV file. There are also additional columns that are hidden by default; these can be revealed using the Show all button, or individuals columns can be selected using the Column visibility button. All columns can also be reordered just by clicking and dragging. The datatables can be sorted by individual columns; the default is descending order by O/E ratio.

Most columns in the unigram, bigram, and trigram sections have the same interpretation:

  • the SV and NSV columns give the number of occurrences (counts) of an n-gram in a given layer.
  • the SV/total column shows the percentage (out of 100) of the n-grams that occur in the SV layer.
  • the O/E column gives the observed/expected ratio for this n-gram, where the expectation is based on frequencies of occurrence in the two layers:

\[\frac{\text{Count in SV list}}{\text{Count in both lists}} \times \frac{\text{Length of lexicon}}{\text{Length of SV list}}\] Since the lengths of the lexicon and the SV list are both constant, this means that O/E is linearly proportional to the percentage of occurrences in the SV layer (here, SV/total \(\times\) 4.234141).

The advantage of the O/E ratio lies in its intepretability: when O/E \(\approx\) 1, then the n-gram occurs in the SV list about as often as expected, i.e. about 24% of the time (SV/total \(\approx\) 0.24); when O/E < 1, the n-gram occurs less often than expected; and when O/E > 1, it occurs more often than expected.

2 Unigrams

  • The unigram tables contain two hidden columns %SV and %NSV, which are the list-specific positional frequencies of each segment. They are not true unigram frequencies. For example, in the Onsets table, k has a %SV of 8.72%; this means that 8.72% of syllables in the SV list begin with k. We treat the onset as obligatory, e.g. oan is treated as having the onset ʔ. The %SV and %NSV columns are hidden by default; use the Show all or Column visibility buttons to reveal them.

  • The histograms plot each segment’s percentage of the total segments of that type in each layer, ordered by %SV. The histograms are also hidden by default.

2.1 Onsets

2.2 Medials

2.3 Nuclei

2.4 Codas

2.5 Tones, live syllables

2.6 Tones, dead syllables

3 Bigrams

  • The bigram tables contain two columns PMI_SV and PMI_NSV, which give the pointwise mutual information scores for the segment pair in the relevant list. PMI describes the increase or decrease in the cost of describing a segment in a particular environment. Positive PMI for a sequence AB in list L means that when we observe segment A, we are unsurprised to find that segment B occurs after it, whereas negative PMI means that we are more surprised to see B, given that we’ve seen A. PMI is colored green when it exceeds 0.25 and red when it is less than -0.25, but there is nothing inherently special about these values. These columns are hidden by default; use the Show all or Column visibility buttons to reveal them.

  • Again, all segments are treated as “positionally specific”. That is, final -k and onset k are not the same k for purposes of determining frequencies (and therefore pointwise mutual information). This is partly because what we are interested in is the positional stickiness, and partially because they are arguably different (phonetic) segments.

  • The heatmaps indicate the number of occurrences of a bigram in a given layer. Hover over a cell in the heatmaps to see the exact count of bigrams for that cell. In the heatmaps only, bigrams with n=1 are not shown.

3.1 Onset-nucleus

3.2 Onset-medial

3.3 Onset-coda

3.4 Onset-tone

3.5 Nucleus-coda

3.6 Coda-tone

4 Trigrams

4.1 Onset, medial, nucleus

4.2 Onset, medial, coda

4.3 Medial, nucleus, coda

4.4 Nucleus, coda, tone

5 Syllable structure

  • possible is the count of possible syllables of this shape. What counts as a “possible” syllable? Different ways to do it; here we assume:

    • 24 “plain” onsets (including ʔ but excluding w; we distinguish orthographic d gi in addition to s x)
    • 12 nuclei [aː e əː ɛ i ɨ ɔ o u iə ɨə uə] with unrestricted distribution following plain onsets
    • 2 nuclei [a ə] that cannot occur in open syllables
    • 17 “labializable” onsets [ɗ t tʰ s z l r c ʂ ɲ ʈ k x ɣ ŋ h ʔ] (we treat onset w here like a labialized ʔ for co-occurrence reasons) which may not be followed by [ɨ ɔ o u ɨə uə] (ostensibly the single exception is quốc but it is typically pronounced [kwək])
    • 3 nasal codas [m n ŋ] and 3 unreleased plosive codas [p t k]
    • 2 semivowels [w j] with restricted distribution: [j] cannot follow [i iə e ɛ] and [w] cannot follow [əː ɔ o u uə]
    • a “null” coda that can only occur with 12 of the 14 nuclei
    • 6 tones that can occur with sonorant or null finals
    • 2 tones that can occur with obstruent codas
  • SV and NSV are the counts of syllables of these shapes in the SV and NSV lists, respectively

  • %SV and %NSV are the percentages of the possible number of syllables of this shape that occur in the SV or NSV lists, respectively. %possible is simply the sum of %SV and %NSV.

5.1 Possible and attested syllables

Takeaways:

  • Out of about 17,500 possible syllables, roughly half are attested, and of that half, about 25% are SV
  • The distribution of attested syllables relative to possible syllables is extremely uneven. For instance, out of all possible CV sequences (including tones), nearly 80% are attested, while only about half of all possible CVN sequences are.
  • Only around 5% of attested syllables have a Cw- onset, compared to 30% of possible syllables (as calculated here). Thus, it may be more accurate to state that, generally speaking, Vietnamese makes use of almost the entirety of the space of possible CV syllables, but only about half the possible space of C(C)VC syllables.

5.2 Canonical syllable shape

Trần & Vallée 2009 report that “the prevalent monosyllabic pattern in Vietnamese…was the CVC syllable type, respectively 70% and 34% of the monosyllabic words, and respectively 70% and 20% of the language syllable inventory” (2009:232). Their counts were derived from a list of words with frequency above 2% in a 5,000 word lexicon. If we collapse the above table into their three categories (CV, CVC, CCVC), we see the numbers are quite close: about 21% C(C)V, 71% CVC and 8% CCVC.